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Abstract 

This paper summarises the current state-of-the 
art in the study of compositionaHty in distribu- 
tional semantics, and major challenges for this 
area. We single out generalised quantifiers and 
intensional semantics as areas on which to fo- 
cus attention for the development of the the- 
ory. Once suitable theories have been devel- 
oped, algorithms will be needed to apply the 
theory to tasks. Evaluation is a major prob- 
lem; we single out application to recognising 
textual entailment and machine translation for 
this purpose. 



1 Introduction 

This paper summarises some major challenges for 
the nascent field of distributional compositional se- 
mantics. Research in this area has arisen out of the 
success of vector-based techniques for representing 
aspects of lexical semantics, such as latent seman- 
tic analysis ( |Deerwester et al., 1990| l and measures 
of distributional similarity CLin, 1998t |Lee, 1999| . 

The automatic nature of these techniques mean 
that much higher coverage can be achieved com- 
pared to manually constructed resources such as 
WordNet ( |Fellbaum, 2005! ). Additionally, the 
vector-based nature of the semantic representations 
allow for fine-grained aspects of meaning to be in- 
corporated, in contrast to the type of relations typ- 
ically expressed in ontologies; moreover the con- 
struction of an ontology is generally a subjective 
process, whereas vector-based approaches are typ- 
ically more objective, being formed from observa- 
tions of the contexts in which words occur in large 
corpora. There are disadvantages: automatic tech- 



niques are arguably less reliable than manually con- 
structed resources, and often do not explicitly iden- 
tify the variety of relationships between words that 
are captured in an ontology such as WordNet. 

Researchers have begun to look at how such tech- 
niques can be extended beyond the word level to 
represent meanings of phrases and even whole sen- 
tences. Existing techniques cannot be applied di- 
rectly beyond phrases of two or three words because 
of the problem of data sparseness — as the length 
of the phrase increases, the amount data matching 
the phrase falls off very quickly, and soon there is 
not enough data to build vectors reliably. The alter- 
native is to look at how to compose such vectors, so 
that the vector for a phrase or sentence is determined 
purely by the vector representations for the individ- 
ual words in the sentence. 

While interest in this area has exploded in re- 
cent years, and some significant advances have been 
made, there is still a lot of work to do: 

• The underlying theory needs to be developed 
to allow distributional approaches to describe 
aspects of natural language meaning easily de- 
scribed by model-theoretic semantics, for ex- 
ample, generalised quantifiers and intensional 
semantics. We explain below why current ap- 
proaches are not suited to either of these. 

• New algorithms and tools are needed to per- 
form inference with the new theories. 

• We need suitable methods for evaluating dis- 
tributional models of compositionality. In ad- 
dition, approaches need to be evaluated across 
a broader range of natural language processing 
tasks. In particular we identify textual entail- 



ment and machine tianslation as suitable areas 
for application of current and future techniques. 

In the remainder of the paper, we summarise ex- 
isting work (Section |2ll, then motivate each of the 
above areas in detail (Section [3]). 

2 Background 

Vector representations provide a rich variety 
of possible methods of composition. The 
most obvious method is perhaps vector addition 
dLandauer and Dumais, 19971 Poltz et al., 1998| ), in 
which a string of words is represented by the 
sum of the individual words making up the string. 
This method has several problems, the most ob- 
vious of which is that the operation is commu- 
tative, whereas natural language meaning is not: 
John hit Mary does not mean the same as Mary 
hit John. Another composition operation that suf- 
fers from this problem is point-wise multiplication 



( [Mitchell and Lapata, 2008] ). 

A method of composing vectors that avoids 
this issue is the tensor product ( [Smolensky, 1990 



Clark and Pulman, 2007t [Widdows, 2008] ). Given 
two vectors u and v in vector spaces U and V of di- 
mensionality m and n respectively, the tensor prod- 
uct li (g) f is a vector in a much larger space U <SS>V 
of dimensionality mn. Each pair of basis vectors in 
U and V has a corresponding basis vector in [/ F, 
so given a tensor product u w it is always possi- 
ble to deduce the original vectors u and v, another 
property that is missing from vector addition. 

The problem with the tensor product is that strings 
of different lengths have different dimensionalities 
and live in different vector spaces and are thus not 
directly comparable. This means that we cannot say 
to what extent big dog entails dog. There are several 
ways to get around this: 

• Use some linear map from the tensor prod- 
uct space to the original space to reduce 
the dimensionality of vectors and allow them 
to be compared. This was suggested by 
Mitchell and Lapata (2008 [ ) as a general "mul- 
tiplicative model" of composition. The prob- 
lem with this method is that information is lost 
as meanings compose since all strings have the 
same dimensionality. 



Impose relations on different tensor pow- 
ers of the space to make them comparable 
( [Clarke et al., 201()| ). This approach allows a 



lot of flexibility in describing composition but 
it is not clear how to determine what relations 
should be imposed, nor how we can easily com- 
pute with the resulting structures. It does, how- 
ever, resolve the problem of information loss as 
strings are composed. 



The approach of [Grefenstette et al. (201 1[ ) is in- 
spired by some mathematical similarities between 
the structure of vector spaces and that of pregroup 
grammars: they are both compact closed categories. 
Their approach can be viewed as a vectorisation of 



Montague semantics ( [Clark et al., 2008 [ ). 

Other approaches to this problem include the use 
of matrices (Rudolph and Giesbrecht, 20101 
including those learnt directly from data 



( [Baroni and Zamparelli, 201()| l. 



2.1 Context-theoretic Semantics 



The framework of [Clarke (2012[ ) is a mathematical 
formalisation of the idea that meaning is determined 
by context. The structure that is proposed to model 
natural language semantics is an associative alge- 
bra over the real numbers M. This is a real vector 
space A, together with multiplication which satisfies 
a property called bilinearity: 

a{ab + /3c) = aah + (3ac 
{aa + /3b)c = aac + /3bc 

for all a,b,c € A and all a, /3 e M. It can be 
shown that this type of structure generalises all the 
approaches we discussed above ( [Clarke, 2012[ ). 

[Clarke (2012 1 also proposes a principle to deter- 
mine entailment between strings in distributional 
semantics, based on the concept of distributional 
generality ( [Weeds et al, 2004| ), that terms that have 
a more general meaning will occur in a wider range 
of contexts. The theory assumes the existence of 
a distinguished basis which can be interpreted as 
defining the contexts in which strings can appear. 
This defines a partial ordering on the vector space 
by n < f if and only if every component of u is 
less than or equal to the corresponding component 
of V. The partial ordering is interpreted as entail- 
ment and is connected with distributional generality 



since x < y ify occurs at least as frequently as x in 
every context, where x and y are the vectors associ- 
ated with terms x and y. 

3 Challenges 
3.1 Theory 

The greatest problem currently facing attempts to 
describe meaning using vectors is to reconcile them 
with existing theories of meaning, most notably log- 
ical approaches to semantics. If distributional se- 
mantics is to replace logical semantics, it has to en- 
compass it, since there are things that logical seman- 
tics does very well that it is hard to imagine distri- 
butional semantics doing in its current form. For 
example, it is conceivable that an intelligent agent 
could be built which interpreted natural language 
sentences using logic. The agent would chose the 
best course of action given a set of assumptions, per- 
haps using a combination of theorem provers, auto- 
mated planning and search tools. The functionality 
provided by the theorem proving component in such 
a system would be essential, allowing diverse pieces 
of knowledge from a variety of sources to be com- 
bined and deductions to be made from them. This 
is something that distributional approaches are not 
currently able to do. 

Encompassing a whole logical semantic formal- 
ism in a manner consistent with distributional se- 
mantics is an ambitious goal. We have identified two 
particular areas with the following chai^acteristics: 

• They are intuitively familiar and easy to under- 
stand 

• They occur fairly frequently in ordinary speech 
and writing 

• No existing framework for compositionality in 
distributional semantics deals with them satis- 
factorily 

It is our hope that by concentrating on these areas 
we are able to make progress towards the ultimate 
goal. 

Generahsed Quantifiers 

The study of generalised quantifiers concerns 
expressions such as some, most but not all, 
no and at least two. In the analysis of 



Barwise and Cooper (1981 1, which is based on the 



earlier work of Montague (1974 1, the semantics of 
determiners such as these is to operate on a set of 
entities (for example the set of people) and to return 
a set of sets, for example the semantics of most peo- 
ple is the set of all sets of entities which contain most 
people. 

Formalising these properties mathematically al- 
lows us to understand some properties of entailment 
between sentences containing such quantifiers. For 
example, all animals breathe entails all cats breathe, 
whereas some cats like cheese entails some animals 
like cheese; the change in quantifier has reversed the 
direction of the entailment. 

This property cannot be captured within the 
framework of [Clarke (20I2| |, because of the in-built 
property of linearity of the multiplication in the un- 
derlying algebra. If we accept the idea of distribu- 
tional generality, that cat should entail animal be- 
cause the latter will occur in a broader range of con- 
texts, then it follows from linearity that x cat y will 
entail x animal y for any strings x and y. More gener- 
ally, for any u,v,w & A such that u < v, uw < vw 
and wu < wv. 

In fact, what the reversal of entailment indicates 
is that quantifiers such as all are non-hnear; they 
are not compatible with the bilinearity condition of 
context-theoretic semantics. This is a problem for all 
existing approaches to the problem of composition- 
ality in distributional semantics, since linearity is a 
common assumption among them ( [Clarke, 20I2[ ). 

The work of [Preller and Sadrzadeh (201 1\ ad- 
dresses the problem of representing negation in dis- 
tributional semantics using Bell states. Since nega- 
tion results in a similar reversal of entailment, it is 
possible that such an approach would also be useful 
for modelling generalised quantifiers. 

Intensional Semantics 

Intensional semantics deals with certain com- 
plex semantic phenomena such as those involving 
the verbs know, believe, want and need. These 
are described elegantly in Montague semantics 



( [Montague^ 1974) , and the ability to reason about 
such concepts is essential for intelligent agents that 
would interact with humans in natural language. 
Reasoning about such sentences requires additional 
knowledge about the meaning of these words that 



would normally be described in terms of logic; it 
is hard to imagine how their meanings could be de- 
scribed reliably within distributional semantics. 

3.2 Algorithms and Tools 

In order to compete with logical methods in se- 
mantics, distributional semantics needs to be able 
to, given a fixed set of background knowledge (ex- 
pressed in natural language): 

1. Truth: Estimate the probability that a given 
sentence is true. 

2. Search: Given a parameterised sentence, for 
example the queen was born in x, find the pa- 
rameter X which maximises the probability of 
the sentence. 

3. Entailment: Given two sentences, compute 
the degree to which the first entails the second. 

The first and third of these will be useful in tasks 
such as question answering while the third will be 
useful for any of the tasks associated with textual 



entailment (Dagan et al., 2005 1, for example infor- 
mation retrieval. 

There are more complex tasks that may not be 
expressible in terms of distributional semantics, for 
example those needed in planning for an intelligent 
agent; the exact formulation for such tasks may de- 
pend on the the particular semantic formalism cho- 
sen. 

When designing algorithms for these tasks, it is 
likely that we will be able to compute the answer 
much faster if we allow an approximation to the 
answer, which may be perfectly suitable for many 
tasks. Without a satisfactory theory of meaning, 
however, it is hard to speculate on the possible na- 
ture of such algorithms. 

3.3 Evaluation Methods 

A problem for researchers working in this field is 
how to evaluate models of compositionality. Re- 
searchers have evaluated models on short phrases 
by determining context vectors for the phrases and 
for individual words directly. They then com- 
pose the vectors for individual words using their 
models to obtain vectors for phrases and measure 
how similar these are to the observed phrase vec- 



tors (Baroni and Zamparelli, 2010 [Guevara, 2011| l. 



This evaluation technique cannot be extended be- 
yond short phrases however, so may not provide a 
good measure of how good models are at handling 
deep semantics. 

The recent Workshop on Distribu- 
tional Semantics and Compositionality 
( [Biemann and Giesbrecht, 201 1[ ) provided a dataset 
and a shared task of determining to what degree 
a phrase is compositional. This is undoubtedly a 
useful task, but again does not address the question 
of deep semantics. 

In order to evaluate deep semantics, we propose 
applying methods to two tasks requiring deep se- 
mantics to perform well: recognising textual en- 
tailment and machine translation. We believe these 
tasks are suitable for this purpose because they 
would intuitively seem to require deep semantics 
to achieve perfect performance, yet statistical ap- 
proaches are able to achieve reasonable to good per- 
formance. These tasks would thus provide a testing 
ground in which the sophistication of the techniques 
applied can be increased gradually towards deep se- 
mantics, the hope being that the more sophisticated 
techniques will lead to improved performance. 

4 Conclusion 

We have summarised some approaches to mod- 
elling compositionality in distributional semantics, 
and highlighted some challenges which we believe 
to be pertinent. In particular, we identified some as- 
pects of the theory of distributional semantics which 
we believe to be lacking; anyone able to resolve 
these will necessarily push the boundaries of our un- 
derstanding of meaning. 
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